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What makes a good role model 
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This paper is dedicated to Jim Massey on the occasion of his 75''^ birthday 

Abstract 

The role model strategy is introduced as a method for designing an estimator by approaching the output of a 
superior estimator that has better input observations. This strategy is shown to yield the optimal Bayesian estimator 
when a Markov condition is fulfilled. Two examples involving simple channels are given to illustrate its use. The 
strategy is combined with time averaging to construct a statistical model by numerically solving a convex program. 
The role model strategy was developed in the context of low complexity decoder design for iterative decoding. 
Potential applications outside the field of communications are discussed. 

Index Terms 

Bayesian estimation, convex programming, low complexity decoding 

I. Introduction 

Picking a role model is a natural way of coping with challenges by striving to emulate a person we admire; children 
look up to superheroes, musicians emulate Mozart, and information theorists follow Claude E. Shannon. Think of 
it this way: when faced with a difficult choice, solving your problem on your own may seem like a lonely, daunting 
prospect. Instead, you seek refuge behind your role model and try to imagine what {Superman/Shannon/your 
parents/your PhD advisor} would do in your place. But is this a good strategy? Are you better off solving your 
problem on your own rather than doing what you believe your role model would have done? This paper attempts 
to answer this question in a probabilistic, Bayesian setting. The main result of the paper is that, if you pick the 
right role model, the best solution you could obtain on your own is identical to the solution obtained via the "role 
model" strategy, which might be less tedious. 

Despite the philosophical undertone of the previous paragraph, our interest in role models did not originate in 
mere psychological curiosity. The "role model" terminology is a metaphor that we use to add an intuitive perspective 
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to an Otherwise dry statistical question. The motivation for the questions addressed in this paper is the theoretical 
framework for the design of post-processing functions in iterative decoding using low complexity components as 
developed in |[T], ||2l, [S). We believe that the results have applications in other fields where statistical modeling 
based on derived observations is required and we will show a number of examples to illustrate this. Furthermore, 
knowing that the "role model" strategy is optimal given certain constraints is not only of academic interest: in the 
applications considered, it allows us to replace a complex statistical measurement problem by a convex optimization 
problem with all the convenient baggage of numerical methods that come with it. 

In Section |II] we will give a mathematical definition of the role model strategy. Section |III] contains the main 
theorem stating when the role model strategy is optimal. Section |IV] shows two examples for which the strategy 
can be used. Section |V] discusses potential applications. 

II. The Role Model Strategy 

Let X, Y and Z be discrete random variables. In "role model" terminology, X is the quantity in which you are 
interested, Y is the knowledge available to your role model, and Z is the knowledge available to yourself. Your 
aim is to use your knowledge Z to make the best possible summary description of X. We assume at this point in 
our analysis that the joint distribution Pxyz is known. 

The obvious direct solution of this problem, given the observation Z = z,is to compute the a posteriori probability 
distribution Px\z=z for X, which is a description of X that incorporates everthing of statistical interest given the 
knowledge that Z = z. We can write this direct solution as the random variable Sd{Z), which is a function of the 
random variable Z and whose value when Z ~ z is 

Si{z) = (1) 

Note that Sd{Z) takes values in the space of probability distributions over X. This solution is optimal from a 
Bayesian perspective, and has the added benefit that it is "information lossless" in the sense that 

IiX;S^iZ))^I{X;Z), (2) 

as was shown in 

Throughout our education, we are taught to look carefully at all the "givens" in a problem. In textbook exercises, 
failure to use one of the "givens" is often a signal that our solution is wrong or sub-optimal. Considered from this 
perspective, the Bayesian a posteriori probability calculation described above is disappointing: the random variable 
Y does not appear anywhere and the solution makes use of Pxz only, although Pxyz is known. This solution is 
not specific to the problem and would have been the same if the problem involved only X and Z. 

To make the solution depend on Y, we consider an alternative strategy. If our role model is indeed superior to us 
with respect to the problem X, then we could try to imagine the solution that our role model would have computed 
given Y. Since we do not know the realization Y = y, we cannot compute the exact Bayesian solution Px\Y=y 
that our role model would have computed. We can, however, attempt to produce a distribution Qx\z=z that gets as 
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close as possible to Px\Y=y in expectation over all possible realizations y of Y. Here we have written Q instead 
of P to stress that this is not generally a true a posteriori probability distribution, but rather an arbitrarily chosen 
function of Z taking values in the space of probability distributions over X. 

A good measure of "closeness" when it comes to probability distributions is the Kullback-Leibler divergence. 
For reaUzations Z ~ z and Y = y, the divergence 

measures how close Qx\z=z is to Px\Y=y^ i-6-, how close we are to our role model's solution. Averaging the 
divergence over the possible realizations of Y gives 

Y=y\\Qx\Z=z), (4) 

y 

Averaging the divergence jointly over Y and Z gives 

Y=y\\Qx\Z=z)- (5) 

y 2 

We can now state in mathematical terms what we mean by "emulating a role model": 

Definition 1 (The "role model" strategy): For every possible value z of Z, choose the probability distribution 
Qx\z=z over X that minimizes ^D{Px\y\\Qx\z=z) to be the description Sr:m{z) of X given the observation 

Z = z. 

Note that since the role model strategy minimizes the expected divergence for every realization z of Z, it also does 
so on average. Thus 5'rm(^) also minimizes '^D{Px\y\\Pq\z\ In fact, any Qx\z that minimizes EZ3(Px|yIIQx|z) 
is identical to S^{Z') for all z for which P{z) > 0. Values of z for which P(z) = will not make a difference, 
so we can replace the minimization for every z by a minimization of the average in our definition. 

The variable Qx\z appears in the denominator inside the logarithm in the definitions of expected divergence (HI 
and (|5]l. The logarithm is a concave (convex-fl) function and so is its restriction to the convex set of probability 
distributions for X. Since the expected divergence is a negative sum of logarithms of the variables plus a constant, it 
is a convex (convex-U) function. Therefore, implementing the role model strategy requires the solution of a convex 
program. 

III. The Role Model Theorem 

We now address the question of how well we do by using the role model strategy defined in the previous section 
and how different the solution Srm{Z) is from the direct Bayesian solution Sd{Z). Intuitively, we suspect that 
emulating your role model is a good idea if your role model knows more about the problem X than you do, but 
is a bad idea if you know something about the problem that the role model does not. 

"Knowing more about" a random variable can be formalized with Markov chains. If the joint distribution Pxyz 
is such that 

Pix\yz) = Pix\y) (6) 
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for all possible realizations x, y and z, then X, Y and Z form a Markov chain, denoted X — Y — Z . Information 
theoretically, the Markov condition (|6]l is equivalent to 

I[X-Z\Y)^0. (7) 

In the role model scenario, if X, Y and Z form a Markov chain X ~Y — Z, then all the knowledge in Z about 
X has been learned from Y . 

The following theorem confirms our intuition: 

Theorem 1 (The "role model" theorem): If X, Y and Z form a Markov chain X — Y — Z, then 

^D{Px\y\\Qx\z) = H{X\Z) - H{X\Y)+ED{Px\z\\Qx\z)- (8) 

In particular, 

ED{Px\y\\Qx\z)>H{X\Z)-H{X\Y) (9) 
with equality if and only if Qx\z=z = Px\z=z for all z for which P{z) > 0. 

Proof: Because X, Y and Z form a Markov chain X — Y — Z, Z, Y and X also form a Markov chain Z — Y — X . 
Thus 

P{x\yz) = P{x\v) (10) 



so that 



Substituting into (O gives 



P{xyz)^P{yz)P{x\y). (11) 



^ P(xj/2) log2 }^ P{x\y) 



P{x\z) 



(14) 



P(a;|z) ' Q(a;|z) 

= i7(x|z)-H(x|y) + Ei?(P;fi^||gx|z). (15) 

The inequality (|9]l now follows from the well-known inequality for expected divergence, cf. ||5] 2"'* Corollary of 
Theorem 2.6.3], namely 

^D{Px\z\\Qx\z)>Q. (16) 
with equality if and only if Qx\z=z = Px\z=z for every z for which P{z) > 0. q 

Theorem [T] shows that the role model strategy is optimal, i.e., S'riii(^) = Sd{Z), when a Markov relation holds 
between you, your role model, and the problem you are trying to solve. In other words, if your role model is your 
teacher in the sense that you learned everything you know about X from your role model, then you have chosen 
an appropriate role model for the problem. 
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The following theorem treats the general case when X, Y and Z do not necessarily form a Markov chain: 
Theorem 2: For any discrete random variables X, Y, and Z, 

^D{PxiYz\\Qx\z) > H{X\Z) - H{X\YZ) (17) 

with equality if and only if, for every z for which P{z) > 0, Qx\z=z and Px\z=z are identical. 
Proof: Because 

I{X;Z\YZ) = Q, (18) 

it follows that X, (Y, Z) and Z form a Markov chain X — (Y, Z) — Z. The theorem is obtained by applying 
Theorem [U to this Markov chain. q 

Theorem 12] shows that in order to achieve Bayesian optimality in the general case, Qx\z must be chosen to 
approach Px\yz in expected divergence, rather than Px\y- When X,Y and Z form a Markov chain, then Px\yz 
and Px\Y are identical and the two theorems coincide. 

IV. Examples 

We will begin by showing two examples using simple channels for which the Markov condition applies to 
illustrate the use of the role model strategy and theorem. We will discuss potential applications in the next section. 

A. A first example 

Let X, Y and Z be connected through Z-channels in the manner illustrated in Figure [l] The resulting channel 




Fig. 1. A cascade of two Z-channels 

from X to Z is a Z-channel with crossover probability 3/4. Let us assume that the channel is driven with a uniform 
probability distribution Px(0) = Px{^) = 1/2- In this case, the optimal Bayesian a posteriori distribution is 

P,„(0|0)^^^MffiM.4/7 ,19, 

PzW 

and similarly Px|z(l|0) = 3/7, Px|z(l|l) = 1, and Px|z(0|l) 0. 
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We now apply the role model approach to this scenario. We define the distribution Qx\z- This distribution has 
two free parameters that we denote by go Qx|z(0|0) and qi We assume that Z = was received. 



We would like to choose qo to minimize 

ED{Pxiy\\Qx\z=o) = Py\z{0\0) 



Px|y(0|0)logv 



Px\y{0\0) 



P. 



Y\Z 



p 



p 



X\Y\ 



2 2/3 

3 90 



- log, - 
3 1 



I)l0g2 - 

1/3 



90 

x|y'( 



Px|y(l|0) l0g2 



^x|y(l|0) 



ll) 



90 



90 

1 

+ 7 



Px|y(l|l)log2 
1 



1-90 
^x|f(1|1) 
1-90 



+ 1 log2 



1-90. 



= -y/i(l/3) - ylog2go - ylog2(l -9o), 



(20) 



where h{.) denotes the binary entropy function. Taking the derivative of the expression with respect to go and setting 
it to zero yields go =4/7, which is the same value we obtained using the direct Bayesian approach. Similarly, we 
get gi = 1 by applying the same derivation to the case Z = 1. 

This example shows that the computation resulting from applying the role model approach is different from the 
computation for the direct Bayesian approach, yet both yield the same solution, in line with the role model theorem. 
However, so far there appears to be no advantage in replacing the simple computation of ( fT9] l by the complicated 
derivation of (l20l i. The next example wiU demonstrate an advantage of the role model technique. 



B. A second example 

Let us consider the setup in Figure ID where a binary erasure channel (BEC) with known erasure probability 5 is 
followed by an unknown channel with ternary input alphabet and binary output alphabet. The channel is driven with 
uniform inputs, i.e., Px(0) = Px(l) = 1/2. Note that we now depart from our assumption that Pxyz is known. 




Fig. 2. Cascade of a binary erasure (BEC) and unknown channel 

In most practical scenarios, only partial knowledge of the joint distribution is common. In a one-shot experiment as 
described, it is impossible to obtain a good estimate of X based on Z either with the direct Bayesian approach or with 
the role model approach. We will therefore assume that the experiment is repeated independently, so that the channel 
is driven with an independent and identically distributed (i.i.d.) random process Xi,X2t ■ ■ ■■ Since both channels 
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are memoryless, their outputs are also i.i.d. random processes Fi, I2, • ■ • and Zi, Z2 



For the direct Bayesian 



solution, we can adopt a frequentist approach to estimate the missing probabilities. Let us now concentrate on the role 
model approach. There are again two parameters that need to be optimized; qq = Qx\z (0|0) and qi=Qx\z{m- 
Since we do not know the conditional distribution Py\z, we cannot, as in the previous example, simply determine 
qo by minimizing ED{Px\y\\Qx\z=o)- Let us now consider the definition of the expected divergence (|5]l, which 
we rewrite here for convenience 



EDiPx\Y\\Qx\z) = J2J2^(y^^^(p^\y=y\\Qx\z=^ 



Note that for any given y and z, we can compute the divergence inside the expectation. We lack the knowledge 
about Pyz to be able to compute the expectation. Since the channels are memoryless, all the random processes 
involved are i.i.d. and therefore ergodic, and we can apply the law of large numbers, replacing the expectation by 
an average over time, i.e.. 



1 ^ 

EDiPxiY\\Qx\z) = lim ^ ^ ^(^^-l^--^^/. I l^^-l^.-^)' 

A'— i-oo iV ' ^ 



(21) 



i=i 



with probability 1. Note that this is a mixed frequentist-probabilistic approach, since we are using the true a 
posteriori distribution Px\y and optimizing a probabilistic model Qx\z^ but applying the law of large numbers in 
order to compute the average. We never actually use measured frequencies as we would for the direct Bayesian 
approach. 

Equation |2T| gives us a method for approximating the expected divergence of interest for any given parameters 
qo and qi using a time average, but how do we proceed to optimize the choice of parameters in order to minimize 
the expected divergence? One solution is to use convex optimization, since the function and its average are convex 
in qo and qi. We can apply any numerical convex optimization technique. Note that the partial derivatives of the 
convex function with respect to qo and gi, which are required for some convex optimization techniques, can also 
be computed via time averaging, i.e.. 



1-ED{Pxiy\\Qxiz) 
oqo 



d 



j—J2PYziyO)D{Px \Y=y\\Qx\Z=0 



jr-EPyziyO) 



^x|y(0|2/)log2 



Px\YiO\y) 



-Px|Y(l|2/)l0g2 



X\Y 



I - qo 



d 



■7^ PyziyO) [-Px\YiO\y) log2 qo - Px\Yii\y) log2(i - qo)] 



PxiYiMy) Px\Yio\y) 



X\Y 

1 - go 



X \Y\ 



N 



lim — 



1 ^ 1{Z, = 0) 



log 2 



Xi\Yi 



Xi\Yi 



1 - go 



go 



(22) 



with probability 1, where 1(.) denotes a function that is 1 when the equality holds and zero otherwise. 
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Fig. 3. Divergence measured tlirougli time averaging during the gradient descent optimization 

We have tested this approach using a BEC with 6=1/4 and a channel from Y to Z with parameters Pz|y(0|0) = 
0.9, Pz|y(0|A) = 0.7, and Pz\y{0\^) = 0.2, all unknown to the receiver All time averages are taken as moving 
averages over a window of 100 past observations, and we use a simple gradient search to optimize qo and qi. 
The parameters are initialized as qo = qi = 1/2, and an optimization step is performed for every received symbol 
starting from the lOl"" symbol. Figure [3] shows the evolution of the average divergence throughout the experiment, 
and Figure |4] shows the evolution of the parameters qo and qi. Note that once the divergence reaches its minimum, the 
variables have converged to qo = .7234 and qi = .8182, which corresponds to the correct a posteriori distribution 
Px\z for the compound channel with the parameters we chose. 

Note that this second example is a general test case for the use of the role model strategy, since we could use 
exactly the same method for any known discrete channel from X to Y, including for example a finely quantized 
additive white Gaussian noise (AWGN) channel and any unknown discrete memoryless channel from Y to Z. Since 
we have not yet extended the role model framework to continuous random variables, we cannot make any statement 
about its use for a continuous AWGN channel. The number of parameters to optimize will depend only on the 
alphabet size of X and of Z and not on the alphabet size of Y, e.g., when applied to a finely quantized binary- 
input AWGN channel, no matter how fine the quantization, we will still need to optimize only two parameters if 
the unknown channel has a binary output. For unknown channels with very large output alphabets, it may become 
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Optimal qa = .7234 




3 0.2 0.4 0.6 0.8 1 1.2 
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Processed Symbols 

Fig. 4. Convergence of the parameters qo and qi in a gradient descent metliod 



Optimal qi = .8182 



0.2 0.4 0.6 0.8 1 1.2 



Processed Symbols 



impractical to optimize all the free parameters of the a posteriori distribution as we did in the example. In this case, 
a parametric model for Qx\z can be adopted and the parameters of that model optimized. Note that, depending on 
the parametric model chosen, the problem may not remain convex. 

V. Applications 

Markov chains are ubiquitous in all fields where probability theory is used, from physics to social sciences. 
However, not every scenario involving a Markov chain is a potential application for the role model strategy. A 
specific set of circumstances has to be fulfilled for the role model strategy to be useful. 

In most practical applications, the joint distribution Pxyz is unknown, as was the case in our last example, and 
it is one of the aims of statistical modeling to "fill the gaps" and infer this distribution or one of its constituents 
Px\z from the observed data. 

When the following circumstances are fulfilled, the role model strategy becomes relevant: 

• the direct solution = Px\z=z is hard to obtain. This can have several reasons; 

1) the computation of is too complex; or 

2) we have no access to X when measuring the joint statistics of X and Z. We need to construct an estimator 
for X "blindly", e.g., without the ability to know or determine X in a training phase of the estimator. 
In communications, this is often the case for models that must be adapted within the receiver without 
access to the transmitter; or 

3) the alphabet of Z is too large to estimate a joint distribution of X and Z accurately using the available 
data; 

« there exists a good statistical model for the joint distribution of X and Y, allowing us to effectively compute 
Px\Y=y for any observation y. This may for example be a physical dependence model of X on Y , or an 
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existing model based on extensive statistical measurements that we are unable or unwilling to repeat for X 
and Z; 

• we are granted some access to Y for training purposes. For example: 

1) y is available during an initial training phase; or 

2) Y is available intermittently in a "now you see it, now you don't" fashion, and we can use the data points 
where both Y and Z are available in order to train an estimator that will work for data points where 
only Z is available; or 

3) Y is available for a sub-sample of a population of interest. This may be the case for example in appUcations 
where there is a cost associated with producing Y, so we choose to implement a few instances of Y to 
train an estimator that will be applied to a much larger set of instances of Z. 

There are numerous applications where this set of circumstances is fulfilled. We have successfully applied the 
role model strategy to the design of low-complexity decoders for low-density parity check (LDPC) codes, and the 
results are documented in IQ. Furthermore, a large number of so-called incomplete data problems correspond to 
the setup of Theorem |2] and the role model strategy can be applied to them. Let us consider a few hypothetical 
examples: 

• zoologists wish to monitor the population of leopards in a national park by observing the leopard scats (a 
scientific word for faeces) found at various sites in the park. They received funding to conduct a DNA analysis 
of the scats over a period of 12 months. The DNA gives additional information that allows for a more precise 
estimation of the number of leopards present at each site. After that period, they must be able to continue 
monitoring the population of leopards without the extra information provided by the costly DNA analysis. This 
can be tackled by the role model approach: the estimator without DNA analysis can be trained to mimic the 
better estimator with DNA analysis over the year for which both are available; 

• an internet search engine must constantly be refined so that its page ranking function remains in touch with user 
expectations and with the evolving internet. For some searches, contextual information about the user's past 
and related searches is available, making it easier to rank search results effectively. The role model approach 
can be applied to refine the ranking without contextual information to mimic the ranking with contextual 
information; 

« a home insurance company computes the risk associated with each contract based on a number of input 
parameters, e.g., local crime rate, age of the property, price of the property, etc. When some of this information 
is missing, it is common for agents to provide ballpoint figures for the missing data. Instead, the role model 
approach can be used to refine a risk estimator based on partial data to mimic the estimates computed by the 
role model estimator based on the complete data. 
Unfortunately, no internet search or insurance company has yet volunteered to let us perform measurements using 
their closely guarded data so we cannot yet report on the performance of the role model strategy in this context 
(anyone from these industries who happens to read this paper is cordially invited to contact the author. . . ) The 
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zoology example is based on a real world study Q and, although the researchers would have been happy to 
volunteer their data for statistical testing, in reality they only found a single confirmed leopard, which is too sparse 
a sample to constitute a testing ground for the role model strategy. In the future, we will endeavour to find practical 
applications outside the field of receiver design where the role model strategy provides benefits. 

VI. Conclusion 

We have introduced the "role model" strategy to build one statistical model by mimicking a better statistical 
model, and shown that this strategy is optimal under a Markov condition. We have shown examples of its use 
and discovered that, under certain circumstances, it yields a convex optimization problem that can be solved using 
well-known numerical techniques. In this paper, our aim was to isolate a theoretical result that we derived within 
our work and to show its applicability within a wider context. Its application to the design of low-complexity 
decoders for LDPC codes is described in detail in a separate paper ||6l. 

A final remark concerns this paper's dedication and the "role model" relationship that underlies it: let X be the 
problem of how to write a good paper, Y be the skill and knowledge of the author's PhD advisor Jim Massey, and 
Z be the skill and knowledge of this paper's author The author did his best to apply the role model strategy while 
writing this paper. Whether the channel between Y and Z has non-zero capacity and whether X — Y — Z form 
a Markov chain is for the reader to judge. On the other hand, further implications of this relationship could be 
investigated: Jim made it no secret during his lectures that his role model is Claude E. Shannon and that he does 
his best to emulate him. The problem of a multi-stage role model could be an interesting extension to the results 
presented here. Finally, should the associate editor handling this paper assign the review to Jim Massey himself, 
we would be in the realm of role models with feedback, another potentially interesting extension. 
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